Report

Applied Statistics course HdM 2023

1 Setup

Code
import pandas as pd
import numpy as np
import missingno as mno # needed to visualize missing values. install missingno into conda if import does not work!
import altair as alt
import matplotlib.pyplot as plt
import xlrd # needed to read excel files. install xlrd into conda if import does not work!
import shutil # needed to copy files
import time
import datetime
import warnings
import joblib
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import RidgeCV
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

warnings.simplefilter(action='ignore', category=FutureWarning)
alt.data_transformers.disable_max_rows()
DataTransformerRegistry.enable('default')

2 Introduction and data

The aim of this project is to investigate, whether there is a correlation between the household income and the death rate in the United States of America. In order to explore this relation, we gathered data on both topics and will analyse how and to what extense the death rate is impacted by the household income.

2.1 Research Question

Does the household income have an impact on the deathrates in the U.S. and if yes, how big is it?

Our hypotheses regarding the research question is:

The household income and the death rate will have a negative correlation.

Meaning, that the higher the household income is, the lower the death rate will be.

The predictor variable will be the median household income. The main response variable will be the age-adjusted death rate. Further insights can be gained by using categories like death cause, state or year. Other useful information will be provided by the amount of total deaths. The data dictionary below, is showing more details about the required variables.

You can check the appendix for additional information regarding the research question.

Name Description Role Type Format
0 state the U.S. state where data was collected predictor nominal category
1 year considered years (1999 - 2017) predictor numeric discrete date
2 median_household_income median household income predictor numeric continuous float
3 cause name the generic name for the death cause predictor nominal category
4 113 cause name NDI ICD-10 113 categories for causes of death Not used nominal category
5 deaths count of the total deaths Not used numeric discrete int
6 Age-adjusted Death Rate standardized ratio of deaths per 100k population response numeric continuous float

2.2 Data

2.2.1 Import data

We import the original data from the raw folder and transform the .xls file to a .csv file. We also declare our dataframes

Code
# Declare variables
external_data = '..\\data\\external\\'
raw_data = '..\\data\\raw\\'
interim_data = '..\\data\\interim\\'
processed_data = '..\\data\\processed\\'
# File names
orig_income_file = 'Median_Household_Income_By_State_1990-2017.xls'
target_income_file = 'Median_Household_Income_By_State_1990-2017.csv'
orig_death_file = 'NCHS_-_Leading_Causes_of_Death__United_States.csv'
# Save external median income data as csv in folder 'raw'
# Read file
xls_household_file = pd.read_excel(external_data+orig_income_file)
# Save file
xls_household_file.to_csv(raw_data+target_income_file,index = None, header=True)
# Copy external leading cause of death data into folder 'raw'
shutil.copy(external_data+orig_death_file, raw_data+orig_death_file)
# Declare both dataframes
df_income = pd.read_csv(raw_data+target_income_file)
df_death = pd.read_csv(raw_data+orig_death_file)

2.2.2 Data structure

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10868 entries, 0 to 10867
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Year                     10868 non-null  int64  
 1   113 Cause Name           10868 non-null  object 
 2   Cause Name               10868 non-null  object 
 3   State                    10868 non-null  object 
 4   Deaths                   10868 non-null  int64  
 5   Age-adjusted Death Rate  10868 non-null  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 509.6+ KB

In the death dataset we have 10868 cases and 6 columns. THe income dataset looks like this:

Table 102.30. Median household income, by state: Selected years, 1990 through 2017 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 Unnamed: 27 Unnamed: 28 Unnamed: 29 Unnamed: 30 Unnamed: 31

0 rows × 32 columns

2.2.3 Data corrections

Income Dataset

From the overview, we have seen that the df_income dataset needs to be cleaned. You can follow in the code section which steps are needed.

Code
# We only need the first 16 columns 
# and we can also drop the columns 4,6,8,10,12 and 14 since they only show NaN values
column_lst = [0,1,2,3,5,7,9,11,13,15]
df_income_corrected = df_income[df_income.columns[column_lst]]
# We don't need row 0,2 and row 64-67
row_drop_lst = [0,2,64,65,66,67]
df_income_corrected = df_income_corrected.drop(row_drop_lst)
# There are a few rows with empty entries left, which we can get rid off 
# since those rows have no state assigned to it, they are only used as separators
df_income_corrected.dropna(inplace=True)
# The column names are actually in the first row, additionally they need to be adjusted
column_names = ['State','1990','2000','2005','2010','2013','2014','2015','2016','2017']
df_income_corrected.columns = column_names
# Row number 1 can be dropped
df_income_corrected.drop(1,inplace=True)
# The dots in the state column can be removed
# We also do not want any leading or ending spaces in the strings
df_income_corrected = df_income_corrected.replace(r'\.','',regex=True)
df_income_corrected['State'] = df_income_corrected['State'].str.strip()
# Lastly we reset the row index drop the old index 
df_income_corrected.reset_index(drop = True, inplace = True)
# We transform the table to show the median income for a state in a single year
# We use the melt() function for this
lst_years = column_names[1:]
df_income_corrected = df_income_corrected.melt(
    id_vars= ['State'], 
    value_vars= lst_years, 
    var_name= 'year', 
    value_name= 'median_household_income' 
    )
# The State column should be lowercase
df_income_corrected = df_income_corrected.rename(
    columns = {'State':'state'}
)
# The types need to be declared, state holds categorial values, year has integers and income holds float numbers
df_income_corrected['state'] = df_income_corrected['state'].astype('category')
df_income_corrected['year'] = df_income_corrected['year'].astype('int')
df_income_corrected['median_household_income'] = df_income_corrected['median_household_income'].astype('float')
state year median_household_income
0 United States 1990 57500.0
1 Alabama 1990 45200.0

Death Dataset

From the overview of the death dataset, we need also to correct the data. You can follow in the code section, which steps are needed.

Code
# Make a copy to perform corrections on
df_death_corrected = df_death
# Change column names to lowercase
df_death_corrected.columns = df_death_corrected.columns.str.lower()
# Change spaces and the '-' to underscores
df_death_corrected.columns = df_death_corrected.columns.str.replace(r' ','_',regex=True)
df_death_corrected.columns = df_death_corrected.columns.str.replace(r'-','_',regex=True)
# Remove any leading or trailing whitespaces for the object columns
df_death_corrected['113_cause_name'] = df_death_corrected['113_cause_name'].str.strip()
df_death_corrected['cause_name'] = df_death_corrected['cause_name'].str.strip()
df_death_corrected['state'] = df_death_corrected['state'].str.strip()
# Change Dtype of 113 Cause Name, Cause Name and State column to category
cols = df_death_corrected.select_dtypes(include='object').columns.to_list()
df_death_corrected[cols] = df_death_corrected[cols].astype('category')
file = 'Corrected_NCHS_-_Leading_Causes_of_Death__United_States.csv'
df_death_corrected.to_csv(interim_data+file,index = None, header=True)

Joined Dataset

Now, we are able to add the median household income from the income dataset to the death dataset by adding the corresponding value to the correct year and state present in the death dataset.

Code
# Merge both dataframes
df_joined = pd.merge(
    df_death_corrected,
    df_income_corrected,
    how = 'left',
    on = ['year','state']
)

# Save the joined dataset
file = 'Joined_Dataset.csv'
df_joined.to_csv(processed_data+file,index = None, header=True)

We take a look at the joined dataset to see if we need to do anything before using it:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10868 entries, 0 to 10867
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   year                     10868 non-null  int64   
 1   113_cause_name           10868 non-null  category
 2   cause_name               10868 non-null  category
 3   state                    10868 non-null  category
 4   deaths                   10868 non-null  int64   
 5   age_adjusted_death_rate  10868 non-null  float64 
 6   median_household_income  4576 non-null   float64 
dtypes: category(3), float64(2), int64(2)
memory usage: 459.6 KB

Imputed Dataset

We will impute the missing values by using a KNN Imputation since that will typically result in a good imputation for numerical values. The data needs to be scaled in order for the algorithm to perform well.

After the imputation, we’ll have to use the inverse_transform() function from MinMaxScaler to bring the scaled dataset back in the original form. In the appendix, you can find the imputation analysis and statistics in detail.

Code
# Imputation
# Only keep numerical columns
col_num = df_joined.select_dtypes(include=[np.number]).columns.to_list()

# Original household median income 
original_df_joined = df_joined[col_num]

# Scaled household median income
scaler = MinMaxScaler()
scaled_df_joined = scaler.fit_transform(original_df_joined)
scaled_df_joined = pd.DataFrame(data=scaled_df_joined, columns=original_df_joined.columns)

# Impute
imputer_scaled = KNNImputer(n_neighbors=1)
imputed_scaled = imputer_scaled.fit_transform(scaled_df_joined)

# Convert to DataFrames
imputed_scaled = pd.DataFrame(data=imputed_scaled, columns=original_df_joined.columns)

# Inverse the scaling
imputed_scaled = scaler.inverse_transform(imputed_scaled)
imputed_scaled = pd.DataFrame(data=imputed_scaled, columns=original_df_joined.columns)

2.2.4 Variable lists

The detailed list of variables can be found in the data dictionary.

Code
# define outcome variable as y_label
y_label = 'age_adjusted_death_rate'
# select features
features = df_joined.drop(columns=[y_label]).columns.tolist()
# create feature data for data splitting
X = df_joined[features]
# create response for data splitting
y = df_joined[y_label]

2.2.5 Data splitting

# Data Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# data training set
df_train = pd.DataFrame(X_train.copy())
df_train = df_train.join(pd.DataFrame(y_train))

3 Analysis

We will focus on the correlation between the median income and the age-adjusted death rate over all death causes summarized. Although we suspect the death cause, year and the state to be indicators for variation on a more detailed level, this analysis would be beyond the scope of this project.

In order to test our hypotheses we will inspect summary statistics and use different visualizations in order to understand the relations between the predictor and response variables and gain further knowledge. First we copy the training dataframe. This way we prevent the training data to be changed during data exploration:

df_explore = df_train.copy()

3.1 Descriptive statistics

Let’s take a look at the summary statistics for the exploration dataframe:

count mean std min 25% 50% 75% max
year 7607.0 2007.97 5.47 1999.0 2003.0 2008.0 2013.0 2017.0
deaths 7607.0 14941.20 108968.33 21.0 616.0 1734.0 5802.5 2813503.0
median_household_income 7607.0 57785.01 9368.38 40000.0 50400.0 55800.0 63400.0 82400.0
age_adjusted_death_rate 7607.0 126.77 221.84 2.6 19.2 36.0 153.2 1061.2

We also look at the IQR for the features:

year                          10.0
deaths                      5186.5
median_household_income    13000.0
age_adjusted_death_rate      134.0
dtype: float64

We will further explore the data in the next segment and provide a detailed interpretation for the age adjusted death rate as well as the median household income.

3.2 Exploratory data analysis

The distribution for the age adjusted death rate looks like this:

<matplotlib.legend.Legend at 0x256f08e6460>

The statistic and distribution for the age adjusted death rate show:

  • 75 % of the values are at or below 153.2 while the maximum goes up to 1061.2. This is a heavily right skewed distribution with alot of outliers since the IQR is at 134.
  • The distribution visualizes this effect but does not explain it yet. It also shows that it is bimodal.
  • The standard deviation shows that the values differ alot from the mean which can also be explained with the skew.

Our first interpretation is that the column death causes contains summarized values for every cause of death (described with the category ‘All causes’) which could lead to the effect seen in the table and visualization above.

We need to take a more detailed look to correctly interpret this distribution:

Here we can see that the reason for the distribution clearly because the cause_name feature has individual causes as well as a summary for all causes stored within the same column. We will make another dataframe filtered by ‘All causes’ to serve as a summary dataset:

df_explore_causes_summarized = df_explore[df_explore.cause_name == 'All causes'].copy()

Let’s look at the statistics and distribution of the age adjusted death rate again

count mean std min 25% 50% 75% max
age_adjusted_death_rate 675.0 800.13 98.12 572.0 724.4 784.5 869.5 1061.2
<matplotlib.legend.Legend at 0x25680024f10>

We can see that the distribution has changed:

  • The distribution is unimodal and still right skewed but alot closer to a normal distribution.
  • We do not observe the wide range of values as before, there are no more outliers present.

Next we visualize the distribution for the median household income:

<matplotlib.legend.Legend at 0x256801c1070>

The summary statistics for the median household income in combination with the distribution show:

  • The distribution is unimodal and right skewed.
  • The median is found at 56350 $ and with an IQR at 13000 there are no outliers present in the income data.

The imputed data is very close to the original data present in the income dataset. A comparison for the distribution before and after the imputation can be found in the appendix.

To visualize the relation between age-adjusted death rate and median household income, we analyze scatterplots:

There seems to be a moderate to strong negative correlation between the two variables. We suspect the correlation to not be linear which we will investigate later.

Additional data exploration can be found in the appendix.

3.3 Relationships

We take a look at the correlation for the data filterd by ‘all causes’

Code
# inspect correlation for summarized cause name again
corr = df_explore_causes_summarized.corr()
corr.style.background_gradient(cmap='Blues')
  year deaths median_household_income age_adjusted_death_rate
year 1.000000 0.033989 -0.078670 -0.446958
deaths 0.033989 1.000000 0.005858 -0.045224
median_household_income -0.078670 0.005858 1.000000 -0.512797
age_adjusted_death_rate -0.446958 -0.045224 -0.512797 1.000000

The correlation matrix shows, that there is a moderate to strong negative correlation for age-adjusted death rate and median_household_income. That means, that our hyphothesis is supported by the data. The data shows in addition, that the more years pass, the less people die (looking at data from 1999-2017) which could be explained by the improvements in the medical sector as well as the standard of living. To verifiy this assumption we would need further analysis and possibly more data, but this is not within the scope of out project. No other significant correlation is visible.

Additional analysis for relations can be found in the appendix

4 Methodology

4.1 Model

Since the predictor and the response variable are numeric and we try to find a pattern between them, we have a regression problem. We will start with simple linear regression even though we do not assume a linear relationship. In addition we use lasso regression and polynomial regression and compare the models to see which performs better.

We will only use the variable ‘all causes’ and filter the death rates by it, because we focus on the total deathrate in relation to the median household income.

Before we start we encode the categorical features as numeric features for the models and perform the data splitting again since we made changes. We will also drop the year column since we are not evaluating a time series and only use the extra data provided and the death column since it is already factor into the death rate.

Code
# drop 113_cause_name column
df_joined.drop(['113_cause_name', 'year'], axis = 1, inplace = True)

# Filter only for cause_name = 'all_causes'
df_joined_summarized = df_joined[df_joined.cause_name == 'All causes'].copy()

# Drop cause name column since it only contains the same entry and death column since it is already 
df_joined_summarized.drop(['cause_name', 'deaths'], axis = 1, inplace = True)

# Reset index to fit new row number
df_joined_summarized.reset_index(drop = True, inplace = True)

# identify relevant numerical variables
list_num = df_joined_summarized.select_dtypes(include=[np.number]).columns.tolist()
list_num.remove(y_label)

#identify all categorical variables
list_cat = df_joined_summarized.select_dtypes(['category']).columns

#convert all categorical variables to numeric
df_joined_summarized[list_cat] = df_joined_summarized[list_cat].apply(lambda x: pd.factorize(x)[0])

# Perform data splitting
y_label = 'age_adjusted_death_rate'
features = df_joined_summarized.drop(columns=[y_label]).columns.tolist()
X = df_joined_summarized[features]
y = df_joined_summarized[y_label]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# data training set
df_train = pd.DataFrame(X_train.copy())
df_train = df_train.join(pd.DataFrame(y_train))

4.1.1 Linear Regression

Select model

# select the linear regression model
reg_lin = LinearRegression()

Training and validation

# cross-validation with 5 folds only on median_household_income since we use linear regression
scores_lin = cross_val_score(reg_lin, X_train[['median_household_income']], y_train, cv=5, scoring='neg_mean_squared_error') *-1

Fit model

# Fit the model to the median_household_income feature in the training data
reg_lin.fit(X_train[['median_household_income']], y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluation on test set

# obtain predictions
y_pred_lin = reg_lin.predict(X_test[['median_household_income']])

Save metrics for comparison

Code
# R squared
r2_lin = r2_score(y_test, y_pred_lin).round(3)
# MSE
MSE_lin = mean_squared_error(y_test, y_pred_lin).round(3)
# RMSE
RMSE_lin = mean_squared_error(y_test, y_pred_lin, squared=False).round(3)
# MAE
MAE_lin = mean_absolute_error(y_test, y_pred_lin).round(3)

Further analysis of the linear regression model outputs can be found in the appendix.

4.1.2 Lasso Regression

First we need to standardize our numerical features, in order to use lasso regression

Code
# Make copies of the X matrices
X_train_las = X_train.copy()
X_test_las = X_test.copy()

# standardize the dataframes in order for lasso regression to work
scaler = StandardScaler().fit(X_train_las[list_num]) 

X_train_las[list_num] = scaler.transform(X_train_las[list_num])
X_test_las[list_num] = scaler.transform(X_test_las[list_num])

Next we let the algorithm chose which alpha (~number of features used in the model) provide the best results

Code
# Lasso with 5 fold cross-validation
reg_las = LassoCV(cv=5, random_state=0, max_iter=10000)

# Fit model
reg_las.fit(X_train_las, y_train)

# Set best alpha
reg_las = Lasso(alpha=reg_las.alpha_)

# show best alpha
reg_las
Lasso(alpha=0.7649445649117792)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The low value for alpha means that the lasso regression will use more features in the training of the model.

We then perform the same steps as we did with the linear regression model.

Code
# cross-validation with 5 folds
scores_las = cross_val_score(reg_las, X_train_las, y_train, cv=5, scoring='neg_mean_squared_error') *-1
# Fit the model to the complete training data
reg_las.fit(X_train_las, y_train)
# obtain predictions
y_pred_las = reg_las.predict(X_test_las)
# R squared
r2_las = r2_score(y_test, y_pred_las).round(3)
# MSE
MSE_las = mean_squared_error(y_test, y_pred_las).round(3)
# RMSE
RMSE_las = mean_squared_error(y_test, y_pred_las, squared=False).round(3)
# MAE
MAE_las = mean_absolute_error(y_test, y_pred_las).round(3)

Further analysis of the lasso regression model outputs can be found in the appendix.

4.1.3 Polynomial Regression

For Polynomial Regression we need to transform the features to a 2D Space with fit_transform()

poly_reg = PolynomialFeatures(degree=2)
X_poly = poly_reg.fit_transform(X_train)
X_test_poly = poly_reg.fit_transform(X_test)

We then perform the same steps as with the models before

Code
reg_poly = LinearRegression()
reg_poly.fit(X_poly,y_train)
# obtain predictions
y_pred_poly = reg_poly.predict(X_test_poly)
# R squared
r2_poly = r2_score(y_test, y_pred_poly).round(3)
# MSE
MSE_poly = mean_squared_error(y_test, y_pred_poly).round(3)
# RMSE
RMSE_poly = mean_squared_error(y_test, y_pred_poly, squared=False).round(3)
# MAE
MAE_poly = mean_absolute_error(y_test, y_pred_poly).round(3)

4.1.4 Model comparison

Model R2 MSE RMSE MAE
0 Linear Regression 0.222 7051.781 83.975 68.649
1 Lasso Regression 0.222 7052.469 83.979 68.616
2 Polynomial Regression 0.317 6191.236 78.684 63.386

We can see that the Polynomial Regression model performs the best given our data. The Result section gives a more indepth analysis to the model results.

4.1.5 Save models

Code
ts = time.time()
timestamp = datetime.datetime.fromtimestamp(ts).strftime('_%Y-%m-%d-%H_%M_%S')

joblib.dump(reg_lin, "../models/reg_lin_model.pkl" + timestamp )
joblib.dump(reg_las, "../models/reg_las_model.pkl" + timestamp )
joblib.dump(reg_poly, "../models/reg_poly_model.pkl" + timestamp )
['../models/reg_poly_model.pkl_2023-01-11-19_04_13']

5 Results

We have seen that polynomial regression performs the best given our data.

Model R2 MSE RMSE MAE
0 Polynomial Regression 0.317 6191.236 78.684 63.386

Although these metrics are overall the best out of the three models that we ran, there is still alot of room left for improvement.

  • The error values (MSE, RMSE, MAE) show that the data spreads alot around the fitted line
  • is a measure for how well the model fits the data, more specific how much variation of the outcome variable (in our case the age adjusted death rate) can be explained with the predictor (the median household income).
  • An score of 0.317 means that 31.7 % of the variation for the age adjusted death rate can be explained by solely observing the median household income.

For a variable as complex as death rates we consider this to be at least a decent model when only taking the median household income as the sole predictor into account.

The plot underneath shows the final fit made by the polynomial regression. Ignoring the jagged line (a smoothing failure on our part), it depicts a quadratic fit to the data. It could be that a higher polynomial is a better fit when observing the relation between income and death rate, or that the death rate is too complex to only be described by one variable (which we strongly assume).

Plot

Code
# Make new dataframes for the altair charts
df_scatter = X_test.join(y_test)
# Make a series, set the index to the y_test series, make dataframe to join with X_test
series_y_pred_poly = pd.Series(y_pred_poly)
series_y_pred_poly.index = y_test.index
df_y_pred_poly = pd.DataFrame(series_y_pred_poly)
df_y_pred_poly.rename(columns = {0:'age_adjusted_death_rate'}, inplace = True)
df_line_poly = X_test.join(df_y_pred_poly)

# Make plots
scatter = alt.Chart(df_scatter).mark_circle(size=60).encode(
    alt.X('median_household_income', title='Median household income ($)', scale = alt.Scale(zero = False)),
    alt.Y('age_adjusted_death_rate' , title='Age-adjusted death rate', scale = alt.Scale(zero = False)),
    tooltip = ['median_household_income', 'age_adjusted_death_rate']
).interactive()

line_poly = alt.Chart(df_line_poly).mark_line(color= 'red').encode(
    alt.X('median_household_income', title='Median household income ($)', scale = alt.Scale(zero = False)),
    alt.Y('age_adjusted_death_rate' , title='Age-adjusted death rate', scale = alt.Scale(zero = False)),
    tooltip = ['median_household_income', 'age_adjusted_death_rate']
).interactive()

# Put plots together
plot_poly = scatter + line_poly
plot_poly

6 Discussion + Conclusion

The correlation matrix in the relationships section as well as the scatterplot in the data exploration showed, that our hypothesis from the beginning is supported by the data. We have seen that the negative correlation between age-adjusted death rate and median household income can be quantified with a value of r = -0.51. The polynomial regression model was the best one we tried out, although it still had low R². Improvement can be made by trying out more models or add more features (feature engineering / model ensembling).

With the models we used, our research question can be answered with a yes. The impact is quantified by the metric R² and we consider a value of 0.317 to be significant.

6.1 Outlook

We have also seen a significant correlation between the age-adjusted death rate and the years. This could be further analyzed and additional time series models could be built.

In the appendix we made a scatterplot for every death cause related to the median household income and saw that in certain states, certain death causes are more prominent than others. So the correlation between death rate and median household income could be interesting regarding the specific death cause.

7 Appendix

7.1 Additional Information regarding the research question

Our research question is backed by the following studies:

  • KINGE, Jonas Minet, et al. Association of household income with life expectancy and cause-specific mortality in Norway, 2005-2015. Jama, 2019, 321. Jg., Nr. 19, S. 1916-1925. (https://jamanetwork.com/journals/jama/article-abstract/2733322)

  • KAPLAN, George A., et al. Inequality in income and mortality in the United States: analysis of mortality and potential pathways. Bmj, 1996, 312. Jg., Nr. 7037, S. 999-1003. (https://www.bmj.com/content/312/7037/999.full)

  • O’CONNOR, Gerald T., et al. Median household income and mortality rate in cystic fibrosis. Pediatrics, 2003, 111. Jg., Nr. 4, S. e333-e339. (https://publications.aap.org/pediatrics/article-abstract/111/4/e333/63113/Median-Household-Income-and-Mortality-Rate-in)

Although the first study was done in Norway and the second study investigates mortality instead of death rate, we suspect to gather similar observations.

Added information on mortality rate:

Mortality is a fact that refers to susceptibility to death. While there is a crude death rate that refers to number of deaths in a population in a year, mortality rate is the number of deaths per thousand people over a period of time that is normally a year. (see: https://www.differencebetween.com/difference-between-death-rate-and-vs-mortality-rate/).

7.2 Imputation Analysis

Code
df_imputation = pd.read_csv(processed_data + 'Joined_Dataset.csv')
Code
mno.matrix(df_imputation, figsize = (3, 2))
<AxesSubplot: >

After joining, the median_household_income is now an additional column for the death data set.

We only have the median household income for the year 1990, 2000, 2005, 2010 and 2013-2017. In the death dataset, we find the years starting from 1999 until 2017. That is the reason why we only have 4576 non-null values for the median household income. This means that roughly 58 % of the column has empty values which we need to either fill (by imputing) or remove.

Since removing would lead to our dataset to be cut by over half, that is not a viable option.

First lets take a look at the summary statistics for the median household income and its distribution.

7.2.1 Summary Statistics

Code
# Dataframe with only median household income as column
df_household_income = df_imputation[['median_household_income']]

# Summary statistics
df_household_income.describe().round(2).T
count mean std min 25% 50% 75% max
median_household_income 4576.0 58179.33 9580.27 40000.0 51050.0 56350.0 64000.0 82400.0
Code
# Interquartile range
q1 = df_household_income.quantile(q = 0.25)
q3 = df_household_income.quantile(q = 0.75)
iqr = q3-q1
iqr
median_household_income    12950.0
dtype: float64

7.2.2 Distribution

Code
# Distribution of median household income
df_household_income.plot(kind='kde', figsize=(8, 5), title='Distribution of median household income before imputation')
plt.xlabel('median household income')
Text(0.5, 0, 'median household income')

7.2.3 Plot

Code
df_imputation.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10868 entries, 0 to 10867
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   year                     10868 non-null  int64  
 1   113_cause_name           10868 non-null  object 
 2   cause_name               10868 non-null  object 
 3   state                    10868 non-null  object 
 4   deaths                   10868 non-null  int64  
 5   age_adjusted_death_rate  10868 non-null  float64
 6   median_household_income  4576 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 594.5+ KB
Code
# Scatterplot income over years before imputation
scatter_income_before = alt.Chart(df_imputation).mark_circle(size=60).encode(
    x=alt.X('year', 
            title='Year',
            scale=alt.Scale(domain = (1999,2018))
            ),
    y=alt.Y('median_household_income', 
            title='Median Household Income ($)',
            scale=alt.Scale(zero=False)
            ),
    tooltip=['year', 'median_household_income'],
)
scatter_income_before.title ='Median Household Income over the years'
scatter_income_before.interactive()
Code
# Boxplot over all income values before imputation
box_income_before = alt.Chart(df_household_income).mark_boxplot(size = 50).encode(
    x=alt.X('median_household_income', title= 'median household income', scale = alt.Scale(zero = False)),
    y=alt.Y()
).properties(
    width = 450,
    height = 80)
box_income_before

The distribution is unimodal and right skewed extending from 40k $ to over 80k $.

The median is found at 56350 $ and with an IQR at 12950 there are no outliers present in the income data.

Also the values for the years 1999, 2001-2004, 2006-2009, 2011 & 2012 are missing.

7.2.4 Imputed dataset

Let us take a look what the imputation has done for the missing values and how we can use the result:

Summary statistics

Code
# Compare the original and imputed median income column
df_income = df_imputation[['median_household_income']].copy()
df_income['income_KNN_Scaled'] = imputed_scaled['median_household_income']

# Obtain summary statistics
df_income.describe().T[['mean', 'std', 'min', '50%', 'max']]
mean std min 50% max
median_household_income 58179.326923 9580.271444 40000.0 56350.0 82400.0
income_KNN_Scaled 57765.145381 9375.632377 40000.0 55800.0 82400.0
Code
# Interquartile range
q1 = df_income.quantile(q = 0.25)
q3 = df_income.quantile(q = 0.75)
iqr = q3-q1
iqr
median_household_income    12950.0
income_KNN_Scaled          13000.0
dtype: float64

7.2.5 Distribution

Code
# Distribution of median household income
df_income.plot(kind='kde', figsize=(8, 5), title='Distribution of median household income after imputation')
plt.xlabel('median household income')
Text(0.5, 0, 'median household income')

7.2.6 Plot

Code
# Scatterplot income over years after imputation
scatter_income_after = alt.Chart(imputed_scaled).mark_circle(size=60).encode(
    x=alt.X('year', 
            title='Year',
            scale=alt.Scale(domain = (1999,2018))
            ),
    y=alt.Y('median_household_income', 
            title='Median Household Income ($)',
            scale=alt.Scale(zero=False)
            ),
    tooltip=['year', 'median_household_income']
).interactive()

# Vertical concatenation of scatterplots before and after imputation
alt.vconcat(scatter_income_before, scatter_income_after)
Code
# Boxplot over all income values after imputation
box_income_after = alt.Chart(imputed_scaled).mark_boxplot(size = 50).encode(
    x=alt.X('median_household_income', title= 'median household income', scale = alt.Scale(zero = False)),
    y=alt.Y()
).properties(
    width = 450,
    height = 80)

# Vertical concatenation of boxplots before and after imputation
chart = alt.vconcat(box_income_before, box_income_after)

chart = chart.configure_view(
    strokeWidth=0
).configure_title(
    fontSize=20,
    font='Courier',
    anchor='start',
    offset=20
)
chart.title = 'Income boxplot before and after imputation'
chart

From the summary statistics we can see that the imputed values represent the original values quite good, the values are close to the existing ones. The IQR is slightly greater but there are still no outliers after the imputation. This can be explained with the method we used, since the KNN algorithm imputes the values by calculating distances to existing neighbour data points for every instance we want to replace.

We found that by setting the number of neighbours (n_neighbours) to 1, the imputed data represents the original data the best without making assumptions. Since we are no domain experts, we want to change the data as little as possible.

The scatterplot shows that the missing values for median household income have been copied from years with existing values. For example the years 1999 - 2002 all share the values from 2000, 2003 - 2007 the values from 2005 and 2008 - 2011 the values from 2010. 2012 has been cipied from 2013. This has the advantage, that it maintains the spread in the data. The disadvantage is, that we make the implicit assumption that the median household income by state stayed the same for the years close to 2000, 2005 and 2010.

This is visible by comparing the boxplots before and after imputation which are very similar.

In reality the median income almost certainly did not stay the same, but for our case this gives a decent estimation for training our model.

This will leave us with 6 columns in total, which we can use for our data analysis (we will later disregard the 113 Cause name column since we will see, that the same information is provided by the cause name column):

7.3 Additional data exploration

The scatterplots for the relation between median household income and age-adjusted death rate for every single category looks like this:

Code
Chart_CLRD=alt.Chart(df_imputation).mark_circle(size=60).encode(
    alt.X('median_household_income', scale = alt.Scale(zero = False)),
    alt.Y('age_adjusted_death_rate' , scale = alt.Scale(zero = False)),
    color='cause_name',
    tooltip = ['median_household_income', 'age_adjusted_death_rate']
    ).properties(
    width=150,
    height=150
).transform_filter(
    alt.FieldEqualPredicate(field='cause_name', equal='CLRD',)

).interactive()

Chart_Alzheimer=alt.Chart(df_imputation).mark_circle(size=60).encode(
    alt.X('median_household_income', scale = alt.Scale(zero = False)),
    alt.Y('age_adjusted_death_rate' , scale = alt.Scale(zero = False)),
    color='cause_name',
    tooltip = ['median_household_income', 'age_adjusted_death_rate']
    ).properties(
    width=150,
    height=150
).transform_filter(
    alt.FieldEqualPredicate(field='cause_name', equal="Alzheimer's disease")

).interactive()

Chart_Influenza=alt.Chart(df_imputation).mark_circle(size=60).encode(
    alt.X('median_household_income', scale = alt.Scale(zero = False)),
    alt.Y('age_adjusted_death_rate' , scale = alt.Scale(zero = False)),
    color='cause_name',
    tooltip = ['median_household_income', 'age_adjusted_death_rate']
    ).properties(
    width=150,
    height=150
).transform_filter(
    alt.FieldEqualPredicate(field='cause_name', equal="Influenza and pneumonia")

).interactive()

Chart_Suicide=alt.Chart(df_imputation).mark_circle(size=60).encode(
    alt.X('median_household_income', scale = alt.Scale(zero = False)),
    alt.Y('age_adjusted_death_rate' , scale = alt.Scale(zero = False)),
    color='cause_name',
    tooltip = ['median_household_income', 'age_adjusted_death_rate']
    ).properties(
    width=150,
    height=150
).transform_filter(
    alt.FieldEqualPredicate(field='cause_name', equal="Suicide")

).interactive()

Chart_Kidney=alt.Chart(df_imputation).mark_circle(size=60).encode(
    alt.X('median_household_income', scale = alt.Scale(zero = False)),
    alt.Y('age_adjusted_death_rate' , scale = alt.Scale(zero = False)),
    color='cause_name',
    tooltip = ['median_household_income', 'age_adjusted_death_rate']
    ).properties(
    width=150,
    height=150
).transform_filter(
    alt.FieldEqualPredicate(field='cause_name', equal="Kidney disease")

).interactive()

Chart_Unintentional=alt.Chart(df_imputation).mark_circle(size=60).encode(
    alt.X('median_household_income', scale = alt.Scale(zero = False)),
    alt.Y('age_adjusted_death_rate' , scale = alt.Scale(zero = False)),
    color='cause_name',
    tooltip = ['median_household_income', 'age_adjusted_death_rate']
    ).properties(
    width=150,
    height=150
).transform_filter(
    alt.FieldEqualPredicate(field='cause_name', equal="Unintentional injuries")

).interactive()

Chart_Diabetes=alt.Chart(df_imputation).mark_circle(size=60).encode(
    alt.X('median_household_income', scale = alt.Scale(zero = False)),
    alt.Y('age_adjusted_death_rate' , scale = alt.Scale(zero = False)),
    color='cause_name',
    tooltip = ['median_household_income', 'age_adjusted_death_rate']
    ).properties(
    width=150,
    height=150
).transform_filter(
    alt.FieldEqualPredicate(field='cause_name', equal="Diabetes")

).interactive()

Chart_Stroke=alt.Chart(df_imputation).mark_circle(size=60).encode(
    alt.X('median_household_income', scale = alt.Scale(zero = False)),
    alt.Y('age_adjusted_death_rate' , scale = alt.Scale(zero = False)),
    color='cause_name',
    tooltip = ['median_household_income', 'age_adjusted_death_rate']
    ).properties(
    width=150,
    height=150
).transform_filter(
    alt.FieldEqualPredicate(field='cause_name', equal="Stroke")

).interactive()

Chart_Cancer=alt.Chart(df_imputation).mark_circle(size=60).encode(
    alt.X('median_household_income', scale = alt.Scale(zero = False)),
    alt.Y('age_adjusted_death_rate' , scale = alt.Scale(zero = False)),
    color='cause_name',
    tooltip = ['median_household_income', 'age_adjusted_death_rate']
    ).properties(
    width=150,
    height=150
).transform_filter(
    alt.FieldEqualPredicate(field='cause_name', equal="Cancer")
    
    ).interactive()
    
Chart_Heart=alt.Chart(df_imputation).mark_circle(size=60).encode(
    alt.X('median_household_income', scale = alt.Scale(zero = False)),
    alt.Y('age_adjusted_death_rate' , scale = alt.Scale(zero = False)),
    color='cause_name',
    tooltip = ['median_household_income', 'age_adjusted_death_rate']
    ).properties(
    width=150,
    height=150
).transform_filter(
    alt.FieldEqualPredicate(field='cause_name', equal="Heart disease")
).interactive()

Horizontal1=Chart_CLRD | Chart_Alzheimer | Chart_Diabetes 
Horizontal2=Chart_Kidney | Chart_Suicide | Chart_Influenza 
Horizontal3=Chart_Stroke | Chart_Cancer | Chart_Heart | Chart_Unintentional
chart=alt.vconcat(Horizontal1, Horizontal2, Horizontal3, title='Age-adjusted death rate vs. median household income by cause of death')
chart.display()

These charts are showing all death causes in detail. We have one chart for each death cause. Also there is the same negative correlation as in the plot for all death causes.

It seems like, that the death cause is only slightly influencing this correlation. Cancer seems like the death cause with the strongest negative correlation. This could be used for further analysis and model training considering a specific death cause.

Code
chart=alt.Chart(df_imputation).mark_circle().encode(
    alt.X('year', scale=alt.Scale(zero=False)),
    alt.Y('age_adjusted_death_rate', title= 'age-adjusted death rate', scale=alt.Scale(zero=False, padding=1)),
    color='state', 
)
chart.title = "Age-adjusted death rate in differnt years and states"
chart

This chart could be interesting, if we could extend it and visualize the death rate per state and year.

Code
chart=alt.Chart(df_imputation).mark_circle().encode(
    alt.X('year', scale=alt.Scale(zero=False)),
    alt.Y('median_household_income', title= 'median household income', scale=alt.Scale(zero=False, padding=1)),
    color='state', 
)
chart.title = "Development of median household income over the years by states"
chart
Code
chart=alt.Chart(df_imputation).mark_bar(opacity=0.7).encode(
    x=alt.X('cause_name:O', title = 'cause name'), 
    y=alt.Y('age_adjusted_death_rate:Q', title= 'age-adjusted death rate', stack=None),
    color="state",
    tooltip=['state']
    ).transform_filter(
    {'not':alt.FieldEqualPredicate(field='cause_name', equal='All causes')}
)
chart.title = "Causes for age-adjusted death rates in different states "
chart

This graph is showing us, that heart disease in Arizona and cancer in South Dakota, are with distance the top death causes regarding death rate. We remember, that cancer has a really strong negative correlation with the median income.

7.4 Additional relation analysis

Code
#identify all categorical variables
list_cat = df_explore.select_dtypes(['category']).columns
#convert all categorical variables to numeric
df_explore[list_cat] = df_explore[list_cat].apply(lambda x: pd.factorize(x)[0])
Code
# take a look at all correlations
corr = df_explore.corr()
corr.style.background_gradient(cmap='Blues')
  year 113_cause_name cause_name state deaths median_household_income age_adjusted_death_rate
year 1.000000 0.005480 0.005480 0.002946 0.013353 -0.077663 -0.027888
113_cause_name 0.005480 1.000000 1.000000 0.007820 0.097992 -0.015742 0.402470
cause_name 0.005480 1.000000 1.000000 0.007820 0.097992 -0.015742 0.402470
state 0.002946 0.007820 0.007820 1.000000 0.004487 -0.023781 0.005450
deaths 0.013353 0.097992 0.097992 0.004487 1.000000 0.005334 0.224213
median_household_income -0.077663 -0.015742 -0.015742 -0.023781 0.005334 1.000000 -0.031676
age_adjusted_death_rate -0.027888 0.402470 0.402470 0.005450 0.224213 -0.031676 1.000000

We can see that both features 113_cause_name and cause_name provide the same information. Like we predicted we only need to keep one column for the model. Therefore we will drop the column ‘113_cause_name’.

We can ignore the obvious correlations between the age_adjusted_death_rate and cause_name, or deaths as well as deaths and cause_name, since those do not provide extra information.

7.5 Additional model result analysis

7.5.1 Linear Regression

Cross Validation

Code
# store cross-validation scores
df_scores_lin = pd.DataFrame({"lr": scores_lin})

# reset index to match the number of folds
df_scores_lin.index += 1

# visualize MSE for each Fold
chart=alt.Chart(df_scores_lin.reset_index()).mark_line(
     point=alt.OverlayMarkDef()
).encode(
    x=alt.X("index", bin=False, title="Fold", axis=alt.Axis(tickCount=5)),
    y=alt.Y("lr", aggregate="mean", title="Mean squared error (MSE)")
)
chart.title = "MSE by Folds (linear regression)"
chart
Code
df_scores_lin.describe().T
count mean std min 25% 50% 75% max
lr 5.0 7462.394258 1197.570953 6063.244422 6370.508411 7715.406685 8485.279789 8677.531982

Coefficients

Code
# intercept
intercept = pd.DataFrame({
    "Name": ["Intercept"],
    "Coefficient":[reg_lin.intercept_]}
    )

# make a slope table
slope = pd.DataFrame({
    "Name": ["median_household_income"],
    "Coefficient": reg_lin.coef_}
)

# combine estimates of intercept and slopes
table = pd.concat([intercept, slope], ignore_index=True, sort=False)

round(table, 3)
Name Coefficient
0 Intercept 1087.116
1 median_household_income -0.005

Plot

Code
# Make new dataframes for the altair charts
# Make a series, set the index to the y_test series, make dataframe to join with X_test
series_y_pred_lin = pd.Series(y_pred_lin)
series_y_pred_lin.index = y_test.index
df_y_pred_lin = pd.DataFrame(series_y_pred_lin)
df_y_pred_lin.rename(columns = {0:'age_adjusted_death_rate'}, inplace = True)
df_line_lin = X_test.join(df_y_pred_lin)

# Make plots
line_lin = alt.Chart(df_line_lin).mark_line(color= 'red').encode(
    alt.X('median_household_income', title='Median household income ($)', scale = alt.Scale(zero = False)),
    alt.Y('age_adjusted_death_rate' , title='Age-adjusted death rate', scale = alt.Scale(zero = False)),
    tooltip = ['median_household_income', 'age_adjusted_death_rate']
).interactive()

# Put plots together
plot_lin = scatter + line_lin
plot_lin.title = "Scatter chart with linear regression"
plot_lin

7.5.2 Lasso Regression

Cross Validation

Code
# store cross-validation scores
df_scores_las = pd.DataFrame({"lr": scores_las})

# reset index to match the number of folds
df_scores_las.index += 1

# visualize MSE for each Fold
chart=alt.Chart(df_scores_las.reset_index()).mark_line(
     point=alt.OverlayMarkDef()
).encode(
    x=alt.X("index", bin=False, title="Fold", axis=alt.Axis(tickCount=5)),
    y=alt.Y("lr", aggregate="mean", title="Mean squared error (MSE)")
)
chart.title = "MSE by Folds (lasso regression)"
chart
Code
df_scores_las.describe().T
count mean std min 25% 50% 75% max
lr 5.0 7507.486421 1198.093726 6070.625075 6469.607034 7727.331895 8566.590051 8703.278049

Coefficients

Code
# intercept
intercept = pd.DataFrame({
    "Name": ["Intercept"],
    "Coefficient":[reg_las.intercept_]}
    )

# make a slope table
slope = pd.DataFrame({
    "Name": features,
    "Coefficient": reg_las.coef_}
)

# combine estimates of intercept and slopes
table = pd.concat([intercept, slope], ignore_index=True, sort=False)

round(table, 3)
Name Coefficient
0 Intercept 802.619
1 state -0.007
2 median_household_income -46.177

This shows that the lasso regression in our case uses the same features to train the model (only median household income).

Plot

Code
# Make new dataframes for the altair charts
# Make a series, set the index to the y_test series, make dataframe to join with X_test
series_y_pred_las = pd.Series(y_pred_las)
series_y_pred_las.index = y_test.index
df_y_pred_las = pd.DataFrame(series_y_pred_las)
df_y_pred_las.rename(columns = {0:'age_adjusted_death_rate'}, inplace = True)
df_line_las = X_test.join(df_y_pred_las)

# Make plots
line_las = alt.Chart(df_line_las).mark_line(color= 'red').encode(
    alt.X('median_household_income', title='Median household income ($)', scale = alt.Scale(zero = False)),
    alt.Y('age_adjusted_death_rate' , title='Age-adjusted death rate', scale = alt.Scale(zero = False)),
    tooltip = ['median_household_income', 'age_adjusted_death_rate']
).interactive()

# Put plots together
plot_las = scatter + line_las
plot_las.title = "Scatter chart with lasso regression "
plot_las

7.6 Further ideas for visualizations and analysis

Further analysis:

  • How is the median income distributed year by year in South Dakoka?

  • How is the household income developing year by year in the different states? Are there high variances? Is there a relationship to the death rates?

The following models could be interesting, because the models we have choosen were not ideal:

  • Bayesian Regression

  • Decision Tree Regression, mainly xgboost

  • Gradient Descent Regression

In order to find the best performing model, they could be compared using a specific metric (e.g. R², Mean Squared Error or Mean Absolute Error).

Also a model ensemble would be able to perform even better since it uses the strength of different models.

It could be also interesting to carry out model fits, at the level of individual years and compare the metrics with each other, in order to be able to make statements over the years.

Another option is to gather more data, perform feature engineering and build model ensembles to improve overall performance.